Final Project_Shawkat

Author

Shawkat Ali

Published

March 8, 2025

Check in 1:

Background

Mental health challenges such as anxiety, depression, and stress among students in higher education are becoming a significant public health concern. A growing body of research highlights the socioeconomic and demographic factors that influence students’ mental health outcomes. For example, a study of college students in the USA found that mental health issues were associated with factors such as sex, race, ethnicity, religiosity, relationship status, living on campus, and financial situation (Eisenberg, Hunt, & Speer, 2013). Additionally, research suggests that female students are more likely to experience higher levels of stress and depression compared to their male counterparts (Hyde & Grabe, 2008) (M.Pilar Matud, 2004,). These differences are often linked to biological, psychological, and social factors, including coping mechanisms (Nolen-Hoeksema, 2012). Given these complexities, it is important to explore how gender influences students’ depression levels.

This research question is both relevant and timely, as mental health concerns among university students continue to rise. Factors such as academic pressure, social expectations, financial burdens, and job uncertainties contribute to these challenges (Beiter, Nash, R., McCrady , Rhoades, & Linscomb, 2015). By examining gender differences in depression, this study can add to the existing literature on student well-being and help shape interventions that address the specific needs of male and female students. Understanding these dynamics is particularly important for universities and policymakers in Bangladesh, as it can guide the development of gender-sensitive mental health programs, promote awareness, and establish effective support systems for students.

Research Question

How does gender influence depression level among university students?

Hypothesis

Female university students experience higher level of depression compared to male students.

The hypotheses that female university students experience higher levels of stress, anxiety, and depression compared to male students was previously tested and supported by other studies in different parts of the world. For example:

·A meta-analysis by Eisenberg et al. (2007)found that female college students were more likely to experience anxiety and depression than male students.

· A study by Bayram & Bilgel (2008) in Turkey also showed that female students had significantly higher levels of depression and anxiety compared to male students.

However, most of these studies were conducted in Western or specific non-Western contexts. The proposed hypothesis will help us understand whether this gender disparity in MHP also exists in a different socioeconomic and cultural context, such as Bangladesh.

About the data

This dataset offers insight into the Mental Health Problems (MHPs) of university students, specifically assessing stress, anxiety, and depression among students from 15 universities in Bangladesh. The dataset contains 2,028 student responses, collected from 9 public and 6 private universities.

To measure the level of mental health problems, the study employs three well-established psychological scales:

  1. GAD-7 (Generalized Anxiety Disorder-7): Assesses levels of anxiety.
  2. PSS-10 (Perceived Stress Scale-10): Measures stress levels.
  3. PHQ-9 (Patient Health Questionnaire-9): Evaluates symptoms of depression.

Alongside mental health assessments, the dataset includes sociodemographic variables such as age, gender, academic background, and university type (public/private). This enables a comprehensive analysis of the factors influencing students’ mental health.

The data (Syeed, et al., 2024) was collected through an online Google Forms survey, circulated via faculty representatives across the 15 universities. A team of five professors and a student psychologist ensured the adoption and validation of the three mental health scales. The survey was carefully designed to ensure internal consistency, reliability, and a sufficient sample size for meaningful analysis.

The questionnaire was divided into several sections, each focusing on different aspects of students’ mental health and academic experiences. The key variables of interest include:

  1. Age: Categorized into ranges (e.g., Below 18, 18-22, 23-26, etc.).
  2. Gender: Options included Male, Female, and Prefer not to say.
  3. University: A list of universities in Bangladesh was provided for selection.
  4. Department: Various academic departments were listed (e.g., Engineering, Business, Environmental Sciences, etc.).
  5. Academic Year: Ranging from First Year to Fourth Year or equivalent.
  6. Current CGPA: Ranges from Below 2.50 to 3.80 - 4.00.
  7. Scholarship/Waiver: Whether the student received a waiver or scholarship.
  8. Anxiety value
  9. Anxiety label
  10. Depression value
  11. Depression label
  12. Stress value
  13. Stress label
library(ggplot2)
mhp <- read.csv("data/MHP_Processed.csv")
head(mhp, n = 5)
    Age Gender                                          University
1 18-22 Female            Independent University, Bangladesh (IUB)
2 18-22   Male            Independent University, Bangladesh (IUB)
3 18-22   Male American International University Bangladesh (AIUB)
4 18-22   Male American International University Bangladesh (AIUB)
5 18-22   Male                        North South University (NSU)
                                    Department             Academic_Year
1 Engineering - CS / CSE / CSC / Similar to CS Second Year or Equivalent
2 Engineering - CS / CSE / CSC / Similar to CS  Third Year or Equivalent
3 Engineering - CS / CSE / CSC / Similar to CS  Third Year or Equivalent
4 Engineering - CS / CSE / CSC / Similar to CS  Third Year or Equivalent
5 Engineering - CS / CSE / CSC / Similar to CS Second Year or Equivalent
  Current_CGPA waiver_or_scholarship PSS1 PSS2 PSS3 PSS4 PSS5 PSS6 PSS7 PSS8
1  2.50 - 2.99                    No    3    4    3    2    2    1    2    2
2  3.00 - 3.39                    No    3    3    4    2    3    2    2    2
3  3.00 - 3.39                    No    0    0    0    0    0    1    0    0
4  3.00 - 3.39                    No    3    1    2    1    4    3    2    2
5  2.50 - 2.99                    No    4    4    4    2    2    2    0    2
  PSS9 PSS10 Stress.Value          Stress.Label GAD1 GAD2 GAD3 GAD4 GAD5 GAD6
1    4     4           29 High Perceived Stress    2    2    3    2    2    2
2    2     3           24       Moderate Stress    1    2    2    1    1    3
3    0     0           15       Moderate Stress    0    0    0    0    0    0
4    3     2           17       Moderate Stress    2    1    1    1    2    1
5    4     4           32 High Perceived Stress    3    0    3    3    1    1
  GAD7 Anxiety.Value    Anxiety.Label PHQ1 PHQ2 PHQ3 PHQ4 PHQ5 PHQ6 PHQ7 PHQ8
1    2            15   Severe Anxiety    2    2    3    2    2    2    2    3
2    2            12 Moderate Anxiety    3    2    2    2    2    2    2    2
3    0             0  Minimal Anxiety    0    0    0    0    0    0    0    0
4    2            10 Moderate Anxiety    2    1    2    1    2    1    2    2
5    3            14 Moderate Anxiety    1    3    3    3    1    3    0    3
  PHQ9 Depression.Value             Depression.Label
1    2               20            Severe Depression
2    2               19 Moderately Severe Depression
3    0                0                No Depression
4    1               14          Moderate Depression
5    3               20            Severe Depression
library(car)
Loading required package: carData
library(dplyr)

Attaching package: 'dplyr'
The following object is masked from 'package:car':

    recode
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Checking and cleaning the data

mhp <- mhp |> 
    select(
    Gender,
    Academic_Year,
    Current_CGPA,
    waiver_or_scholarship,
    Anxiety.Value,
    Depression.Value,
    Stress.Value
  )

mhp <- mhp |> 
  rename(
    CGPA = Current_CGPA,
    Scholarship_Waiver = waiver_or_scholarship,
    Anxiety = Anxiety.Value,
    Depression = Depression.Value,
    Stress = Stress.Value
  )
# Recoding the "Prefer not to say" value as NA in Gender
# Recoding the "Other" value as NA in CGPA
# Recoding the "Other" value as NA in Academic_Year

mhp <- mhp |> 
  mutate(Gender = na_if(Gender, "Prefer not to say"),
         CGPA = na_if(CGPA, "Other"),
         Academic_Year = na_if(Academic_Year, "Other"))

mhp <- mhp |> 
  filter(!is.na(Gender) & !is.na(CGPA) & !is.na(Academic_Year))

mhp <- mhp |> 
  mutate(
    Academic_Year = dplyr::recode(as.character(Academic_Year),
      "First Year or Equivalent" = "First Year",
      "Second Year or Equivalent" = "Second Year",
      "Third Year or Equivalent" = "Third Year",
      "Fourth Year or Equivalent" = "Fourth Year"
    ),
    Academic_Year = factor(Academic_Year, levels = c("First Year", "Second Year", "Third Year", "Fourth Year"), ordered = TRUE)
  )

mhp <- mhp |> 
  mutate(
    CGPA = dplyr::recode(as.character(CGPA),
      "Below 2.50" = 2.00,                  
      "2.50 - 2.99" = 2.50,
      "3.00 - 3.39" = 3.00,
      "3.40 - 3.79" = 3.40,
      "3.80 - 4.00" = 3.80
    ),
    CGPA = as.numeric(CGPA)
  )

write.csv(mhp, "mhp_cleaned.csv", row.names = FALSE)

The data cleaning and processing steps involved several key transformations to prepare the data set for analysis. First, the data set was subset to include only relevant variables (Gender, Academic_Year, Current_CGPA, waiver_or_scholarship, and mental health scores for Anxiety, Depression, and Stress), which were then renamed for clarity. Next, missing or ambiguous responses (“Prefer not to say” in Gender and “Other” in CGPA and Academic_Year) were recoded as NA and subsequently removed to ensure data consistency. The Academic_Year values were standardized (e.g., “First Year or Equivalent” became “First Year”) and converted into an ordered factor for meaningful comparisons. Similarly,CGPA was recoded from categorical ranges (e.g., “2.50 - 2.99”) to numerical midpoints (e.g., 2.50) and converted to a numeric type for quantitative analysis. Finally, the cleaned data set was saved as a CSV file mhp_cleaned.csv. These steps ensured the data set was tidy, with consistent formatting and no extraneous or ambiguous entries, making it suitable for statistical analysis.

Analyzing the descriptive statistics

head(mhp, n = 4)
  Gender Academic_Year CGPA Scholarship_Waiver Anxiety Depression Stress
1 Female   Second Year  2.5                 No      15         20     29
2   Male    Third Year  3.0                 No      12         19     24
3   Male    Third Year  3.0                 No       0          0     15
4   Male    Third Year  3.0                 No      10         14     17
glimpse(mhp)
Rows: 1,782
Columns: 7
$ Gender             <chr> "Female", "Male", "Male", "Male", "Male", "Male", "…
$ Academic_Year      <ord> Second Year, Third Year, Third Year, Third Year, Se…
$ CGPA               <dbl> 2.5, 3.0, 3.0, 3.0, 2.5, 3.0, 3.0, 3.4, 3.4, 3.4, 3…
$ Scholarship_Waiver <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No…
$ Anxiety            <int> 15, 12, 0, 10, 14, 5, 4, 13, 6, 13, 9, 15, 17, 14, …
$ Depression         <int> 20, 19, 0, 14, 20, 3, 3, 19, 2, 9, 9, 3, 19, 20, 9,…
$ Stress             <int> 29, 24, 15, 17, 32, 18, 8, 24, 18, 23, 29, 20, 24, …
summary(mhp)
    Gender              Academic_Year      CGPA       Scholarship_Waiver
 Length:1782        First Year :444   Min.   :2.000   Length:1782       
 Class :character   Second Year:374   1st Qu.:2.500   Class :character  
 Mode  :character   Third Year :565   Median :3.000   Mode  :character  
                    Fourth Year:399   Mean   :3.076                     
                                      3rd Qu.:3.400                     
                                      Max.   :3.800                     
    Anxiety        Depression        Stress    
 Min.   : 0.00   Min.   : 0.00   Min.   : 0.0  
 1st Qu.: 8.00   1st Qu.: 9.00   1st Qu.:19.0  
 Median :13.00   Median :15.00   Median :22.0  
 Mean   :12.36   Mean   :14.43   Mean   :22.9  
 3rd Qu.:17.00   3rd Qu.:19.00   3rd Qu.:27.0  
 Max.   :21.00   Max.   :27.00   Max.   :40.0  
str(mhp)
'data.frame':   1782 obs. of  7 variables:
 $ Gender            : chr  "Female" "Male" "Male" "Male" ...
 $ Academic_Year     : Ord.factor w/ 4 levels "First Year"<"Second Year"<..: 2 3 3 3 2 1 1 1 3 2 ...
 $ CGPA              : num  2.5 3 3 3 2.5 3 3 3.4 3.4 3.4 ...
 $ Scholarship_Waiver: chr  "No" "No" "No" "No" ...
 $ Anxiety           : int  15 12 0 10 14 5 4 13 6 13 ...
 $ Depression        : int  20 19 0 14 20 3 3 19 2 9 ...
 $ Stress            : int  29 24 15 17 32 18 8 24 18 23 ...
ggplot(mhp, aes(x = CGPA, y = Stress, color = Gender)) +
  geom_point(alpha = 0.4)

The cleaned and refined dataset now includes 1,782 university students, categorized by gender (character type), academic year (First Year: 444, Second Year: 374, Third Year: 565, Fourth Year: 399), and scholarship waiver status (character type). The CGPA scores range from 2.0 to 3.8, with a median of 3.0 and a mean of 3.08, indicating a slight right skew. Mental health metrics reveal that anxiety scores range from 0 to 21, with a median of 13 and a mean of 12.36, while depression scores range from 0 to 27, with a median of 15 and a mean of 14.43. Stress scores are higher, ranging from 0 to 40, with a median of 22 and a mean of 22.9. The distributions for anxiety and depression are roughly symmetric, whereas stress shows a wider spread, suggesting greater variability in student stress levels. These statistics provide a baseline for further analysis of mental health trends across demographics.

# Gender vs Anxiety.Value plot
ggplot(mhp, aes(x = Gender, y = Anxiety, fill = Gender)) +
  geom_boxplot() +
  labs(
    title = "Box Plot of Anxiety Value by Gender",
    x = "Gender",
    y = "Anxiety Value"
  )

# Gender vs Depression.Value boxplot
ggplot(mhp, aes(x = Gender, y = Depression, fill = Gender)) +
  geom_boxplot(width = 0.5) + 
  labs(
    title = "Whisker Plot of Depression Value by Gender",
    x = "Gender",
    y = "Depression Value"
  ) 

# Gender vs Stress.Value box plot
ggplot(mhp, aes(x = Gender, y = Stress, fill = Gender)) +
  geom_boxplot(width = 0.5) + 
  labs(
    title = "Whisker Plot of Stress Value by Gender",
    x = "Gender",
    y = "Stress Value"
  ) 

# Anxiety.Value vs Depression.Value scatter plot
ggplot(mhp, aes(x = Anxiety, y = Depression)) +
  geom_point(alpha = 0.2, color = "blue") +  
  geom_smooth(method = "lm", color = "lightblue", se = FALSE) +  
  labs(
    title = "Anxiety vs. Depression Scatter Plot",
    x = "Anxiety Value",
    y = "Depression Value"
  ) 
`geom_smooth()` using formula = 'y ~ x'

# Anxiety.Value vs Stress.Value scatter plot
ggplot(mhp, aes(x = Anxiety, y = Stress)) +
  geom_point(alpha = 0.4, color = "orange") +  
  geom_smooth(method = "lm", color = "lightblue", se = FALSE) +  
  labs(
    title = "Anxiety vs. Stress Scatter Plot",
    x = "Anxiety Value",
    y = "Stress Value"
  ) 
`geom_smooth()` using formula = 'y ~ x'

# Depression.Value vs Stress.Value Scatter Plot
ggplot(mhp, aes(x = Depression, y = Stress)) +
  geom_point(alpha = 0.4, color = "lightgreen") +  
  geom_smooth(method = "lm", color = "lightblue", se = FALSE) +  
  labs(
    title = "Depression vs. Stress Scatter Plot",
    x = "Depression Value",
    y = "Stress Value"
  ) 
`geom_smooth()` using formula = 'y ~ x'

Hypothesis testing

To test the hypothesis that female university students experience higher levels of depression compared to male students, we will perform an independent samples t-test. This test compares the mean depression scores between two independent groups (male and female students). 

Null Hypothesis (H₀): There is no difference in depression levels between female and male students (μ_female = μ_male).

Alternative Hypothesis (H₁): Female students have higher depression levels than male students (μ_female > μ_male)

# Conducting the t-test
glimpse(mhp)
Rows: 1,782
Columns: 7
$ Gender             <chr> "Female", "Male", "Male", "Male", "Male", "Male", "…
$ Academic_Year      <ord> Second Year, Third Year, Third Year, Third Year, Se…
$ CGPA               <dbl> 2.5, 3.0, 3.0, 3.0, 2.5, 3.0, 3.0, 3.4, 3.4, 3.4, 3…
$ Scholarship_Waiver <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No…
$ Anxiety            <int> 15, 12, 0, 10, 14, 5, 4, 13, 6, 13, 9, 15, 17, 14, …
$ Depression         <int> 20, 19, 0, 14, 20, 3, 3, 19, 2, 9, 9, 3, 19, 20, 9,…
$ Stress             <int> 29, 24, 15, 17, 32, 18, 8, 24, 18, 23, 29, 20, 24, …
head(mhp)
  Gender Academic_Year CGPA Scholarship_Waiver Anxiety Depression Stress
1 Female   Second Year  2.5                 No      15         20     29
2   Male    Third Year  3.0                 No      12         19     24
3   Male    Third Year  3.0                 No       0          0     15
4   Male    Third Year  3.0                 No      10         14     17
5   Male   Second Year  2.5                 No      14         20     32
6   Male    First Year  3.0                 No       5          3     18
ggplot(mhp, aes(x = Depression, fill = Gender)) +
  geom_histogram(position = "identity", alpha = 0.6, bins = 15) +
  labs(title = "Histogram of Depression Scores by Gender",
       x = "Depression Score",
       y = "Count") +
  theme_minimal()

Testing equality of variance

Since the two samples are somewhat normally distributed. we conduct a F-test to determine whether the sample variances are equal or not.

# Conducting F-test to determine whether the sample variances are equal or not
var.test(Depression ~ Gender, data = mhp)

    F test to compare two variances

data:  Depression by Gender
F = 0.9773, num df = 543, denom df = 1237, p-value = 0.76
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.8489462 1.1295921
sample estimates:
ratio of variances 
         0.9772958 

The results of the F-test (p-value = 0.76) confirms that the variances of the two samples are not equal. We, therefore, will be using a Welch’s t-test instead of Student’s t-test.

# Conducting the t-test
t_test_gender <- t.test(Depression ~ Gender, data = mhp, alternative = "greater")

print(t_test_gender)

    Welch Two Sample t-test

data:  Depression by Gender
t = 4.6603, df = 1048.1, p-value = 1.782e-06
alternative hypothesis: true difference in means between group Female and group Male is greater than 0
95 percent confidence interval:
 1.024034      Inf
sample estimates:
mean in group Female   mean in group Male 
            15.53493             13.95153 

Interpreting the results of the t-test

t = 4.66 and p = 1.782e-06 indicates a very strong evidence that females have higher depression scores. 95% Confidence Interval ranges From approximately 1.02 to infinity, supporting our alternative hypothesis. Sample Means for Female and male are 15.53 and 13.95 respectively.

Since the p-value is extremely small (much less than 0.05), we can reject the null hypothesis in favor of the alternative hypothesis. The results of the hypothesis test provide strong evidence that female students experience significantly higher levels of depression compared to male counterparts. As a result our hypothesis is supported by the sample data.

Model Comparison

Key Variables

Response Variable (Dependent Variable): The response variable for our model analysis is Depression. It’s a numeric variable that represents the level of depression reported by each student.

Explanatory Variable (Main Independent Variable): The main independent variable of our model is Gender, which is hypothesized to affect depression levels. In addition we will include the following control variables in our model analysis.

  • Academic_Year (ordinal categorical): We expect this variable to affect depression due to academic stress increasing with time.
  • CGPA (numeric): Studies on student mental health suggest that academic performance may be associated with mental health.
  • Scholarship_Waiver (binary categorical): Financial hardship is widely viewed as one of the factors affecting student mental health. This variable ay reflect financial stress of the students.
  • Anxiety and Stress (numeric): Psychological predictors that often co-occur with depression.

Variable Interaction

  • We consider an interaction term between Gender and Academic_Year in a later model to explore whether the effect of gender on depression varies by year.
  • No variable transformation was required at this stage since all variables were in appropriate formats (numeric or categorical). CGPA has already been converted to numeric scale from categories.

Analyzing Regression Models and Comparisons

Model 1 (Baseline Model)

This is a simple model that we run to test the core hypothesis without confounders. The basic linear regression model shows a statistically significant relationship between gender and depression scores among university students. Female students (the reference group) have an average depression score of 15.53, while male students score on average 1.58 points lower. This difference is significant (p < 0.001), indicating that female students report higher levels of depression than male students. Although the overall variance explained by gender is modest (R² = 0.012), the gender effect is consistent and meaningful in the context of mental health outcomes.

model1 <- lm(Depression ~ Gender, data = mhp)
summary(model1)

Call:
lm(formula = Depression ~ Gender, data = mhp)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.5349  -4.9515   0.0485   5.0485  13.0485 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  15.5349     0.2845  54.611  < 2e-16 ***
GenderMale   -1.5834     0.3413  -4.639 3.75e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.635 on 1780 degrees of freedom
Multiple R-squared:  0.01195,   Adjusted R-squared:  0.01139 
F-statistic: 21.52 on 1 and 1780 DF,  p-value: 3.748e-06
summary(model1)$coefficients
             Estimate Std. Error   t value     Pr(>|t|)
(Intercept) 15.534926  0.2844644 54.611152 0.000000e+00
GenderMale  -1.583392  0.3412883 -4.639455 3.748303e-06
Model 2 (Theoretical Controls Model)

The second model builds on the base model and adds all the theoretically justified control variables. These model controls for academic level, academic performance, financial aid, and mental health co-morbidities. In Model 2, the predictors explain a substantial portion of the variance in depression scores (Adjusted R² = 0.6095). However, after adjusting for academic year, CGPA, scholarship status, anxiety, and stress, gender is no longer a significant predictor of depression (p = 0.885), suggesting the initial gender difference observed in the basic model may be explained by these other factors. Notably, anxiety and stress are strong, highly significant predictors of depression (p < 0.001), and receiving a scholarship or waiver is also positively associated with higher depression scores (p = 0.009). Other control variables, including academic year and CGPA, do not show statistically significant effects.

# Analyzing the second model
model2 <- lm(Depression ~ Gender + Academic_Year + CGPA + Scholarship_Waiver + Anxiety + Stress, data = mhp)

summary(model2)

Call:
lm(formula = Depression ~ Gender + Academic_Year + CGPA + Scholarship_Waiver + 
    Anxiety + Stress, data = mhp)

Residuals:
     Min       1Q   Median       3Q      Max 
-23.4057  -2.8023  -0.0104   2.7261  23.0171 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)            1.14424    0.79159   1.445  0.14850    
GenderMale            -0.03164    0.21816  -0.145  0.88470    
Academic_Year.L       -0.12423    0.20374  -0.610  0.54210    
Academic_Year.Q       -0.14865    0.20256  -0.734  0.46315    
Academic_Year.C       -0.02907    0.19761  -0.147  0.88307    
CGPA                  -0.15075    0.21221  -0.710  0.47756    
Scholarship_WaiverYes  0.63713    0.24386   2.613  0.00906 ** 
Anxiety                0.82438    0.02356  34.991  < 2e-16 ***
Stress                 0.15030    0.01947   7.720 1.93e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.17 on 1773 degrees of freedom
Multiple R-squared:  0.6112,    Adjusted R-squared:  0.6095 
F-statistic: 348.4 on 8 and 1773 DF,  p-value: < 2.2e-16
Model 3 (Automated Variable Selection)

This model uses step-wise AIC-based selection to identify the most parsimonious model. We began the exercise with a full model including all theoretically relevant predictors. The step-wise procedure progressively removed variables that did not contribute meaningfully to model fit, ultimately resulting in a more parsimonious model with only three predictors: Scholarship_Waiver, Anxiety, and Stress. These variables demonstrated the strongest association with depression levels and contributed most to minimizing the AIC. Variables such as Gender, Academic_Year, and CGPA were excluded, suggesting that they did not add significant predictive value beyond what was explained by anxiety, stress, and scholarship status. This confirms that psychological and financial factors are the key drivers of depression in this data set.

model2 <- lm(Depression ~ Gender + Academic_Year + CGPA + Scholarship_Waiver + Anxiety + Stress, data = mhp)
step_model <- step(model2, direction = "both")
Start:  AIC=5098.19
Depression ~ Gender + Academic_Year + CGPA + Scholarship_Waiver + 
    Anxiety + Stress

                     Df Sum of Sq   RSS    AIC
- Academic_Year       3      15.7 30848 5093.1
- Gender              1       0.4 30833 5096.2
- CGPA                1       8.8 30841 5096.7
<none>                            30833 5098.2
- Scholarship_Waiver  1     118.7 30951 5103.0
- Stress              1    1036.4 31869 5155.1
- Anxiety             1   21291.7 52124 6031.8

Step:  AIC=5093.09
Depression ~ Gender + CGPA + Scholarship_Waiver + Anxiety + Stress

                     Df Sum of Sq   RSS    AIC
- Gender              1       0.5 30849 5091.1
- CGPA                1       8.2 30856 5091.6
<none>                            30848 5093.1
+ Academic_Year       3      15.7 30833 5098.2
- Scholarship_Waiver  1     125.3 30974 5098.3
- Stress              1    1050.3 31899 5150.8
- Anxiety             1   21463.9 52312 6032.3

Step:  AIC=5091.12
Depression ~ CGPA + Scholarship_Waiver + Anxiety + Stress

                     Df Sum of Sq   RSS    AIC
- CGPA                1       8.0 30857 5089.6
<none>                            30849 5091.1
+ Gender              1       0.5 30848 5093.1
+ Academic_Year       3      15.8 30833 5096.2
- Scholarship_Waiver  1     127.0 30976 5096.4
- Stress              1    1066.7 31915 5149.7
- Anxiety             1   21485.5 52334 6031.0

Step:  AIC=5089.58
Depression ~ Scholarship_Waiver + Anxiety + Stress

                     Df Sum of Sq   RSS    AIC
<none>                            30857 5089.6
+ CGPA                1       8.0 30849 5091.1
+ Gender              1       0.3 30856 5091.6
- Scholarship_Waiver  1     120.1 30977 5094.5
+ Academic_Year       3      15.2 30842 5094.7
- Stress              1    1079.6 31936 5148.9
- Anxiety             1   21478.9 52336 6029.1
model3 <- lm(Depression ~ Scholarship_Waiver + Anxiety + Stress, data = mhp)
summary(model3)

Call:
lm(formula = Depression ~ Scholarship_Waiver + Anxiety + Stress, 
    data = mhp)

Residuals:
     Min       1Q   Median       3Q      Max 
-23.4132  -2.8221   0.0038   2.7416  22.7285 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)            0.62413    0.35853   1.741   0.0819 .  
Scholarship_WaiverYes  0.62874    0.23902   2.631   0.0086 ** 
Anxiety                0.82467    0.02344  35.180  < 2e-16 ***
Stress                 0.15197    0.01927   7.887 5.35e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.166 on 1778 degrees of freedom
Multiple R-squared:  0.6109,    Adjusted R-squared:  0.6102 
F-statistic: 930.5 on 3 and 1778 DF,  p-value: < 2.2e-16
Model 4 (Interaction Model)

This model tests whether the effect of gender on depression changes across academic years. The results show that neither gender, academic year, nor their interaction terms are statistically significant predictors of depression. This indicates that the effect of gender on depression does not significantly change across academic years. However, anxiety and stress remain strong, highly significant predictors of depression (p < 0.001), and receiving a scholarship or waiver is also associated with higher depression levels (p = 0.009). The model explains a substantial portion of the variance (Adjusted R² = 0.6093) and performs nearly identically to Model 2 and the automated selection model in terms of explanatory power. This suggests that adding interaction terms does not improve the model, and that the core drivers of depression remain anxiety, stress, and financial aid status, rather than gender or academic year.

# Testing a model with potential variable interaction
model4 <- lm(Depression ~ Gender * Academic_Year + CGPA + Scholarship_Waiver + Anxiety + Stress, data = mhp)

summary(model4)

Call:
lm(formula = Depression ~ Gender * Academic_Year + CGPA + Scholarship_Waiver + 
    Anxiety + Stress, data = mhp)

Residuals:
     Min       1Q   Median       3Q      Max 
-23.3967  -2.8199  -0.0038   2.7453  22.9156 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 1.16040    0.79675   1.456   0.1455    
GenderMale                  0.01821    0.22131   0.082   0.9344    
Academic_Year.L            -0.21279    0.38128  -0.558   0.5768    
Academic_Year.Q            -0.49818    0.36789  -1.354   0.1759    
Academic_Year.C            -0.27937    0.34813  -0.802   0.4224    
CGPA                       -0.16107    0.21344  -0.755   0.4506    
Scholarship_WaiverYes       0.64150    0.24424   2.626   0.0087 ** 
Anxiety                     0.82545    0.02359  34.995  < 2e-16 ***
Stress                      0.14860    0.01951   7.617  4.2e-14 ***
GenderMale:Academic_Year.L  0.11372    0.45064   0.252   0.8008    
GenderMale:Academic_Year.Q  0.49527    0.43833   1.130   0.2587    
GenderMale:Academic_Year.C  0.36580    0.42338   0.864   0.3877    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.171 on 1770 degrees of freedom
Multiple R-squared:  0.6117,    Adjusted R-squared:  0.6093 
F-statistic: 253.5 on 11 and 1770 DF,  p-value: < 2.2e-16
Checking multicollinearity

We check the multicollinearity of the two main predictors - Stress and Anxiety. The variance inflation factor (VIF) test indicates that the predictors have a VIF values of ~1.68, which are well below 5, so there’s no serious multicollinearity between Stress and Anxiety. Although Stress and Anxiety are conceptually related (both are psychological variables), they do not overlap enough to distort our proposed regression model. We can, therefore, include both predictors in the model.

model5 <- lm(Depression ~ Stress + Anxiety, data = mhp)
vif(model5)
  Stress  Anxiety 
1.680506 1.680506 

Model Comparison and Interpretation

Analyzing the model performance
Model Description Gender Effect Statistical Significance Interpretation
Model 1 Depression ~ Gender -1.58 Significant (p < 0.001) Female students report significantly higher depression
Model 2 Gender + theoretical controls -.03 Not significant (p = 0.885) Gender effect disappears after adjusting for Anxiety, Stress, etc.
Model 3 Automated variable selection (no Gender) Not included N/A Gender excluded—didn’t improve model fit
Model 4 Interaction: Gender * Academic_Year Not significant Not significant No evidence that gender’s effect varies across academic years
Model fit comparison
Model Adjusted R-squared AIC Interpretation
Model 1 0.011 ~5098 Low explanatory power
Model 2 0.6095 ~5098 Added predictors improve explanatory power
Model 3 0.6096 ~5089 Best AIC and more parsimonious than Model 2
Model 4 0.6093 ~5100 Adding interaction makes no improvement
Summary of model comparison

Model comparisons reveal that while Gender is a significant predictor of Depression in the unadjusted Model 1, its effect disappears after adding control variables in subsequent models. Models 2 through 4 all show strong model fit with Adjusted R² around 0.61, but Model 3, selected through automated variable selection, has the lowest AIC, indicating the best balance between fit and simplicity. This model retains only Anxiety, Stress, and Scholarship_Waiver as significant predictors. The inclusion of interaction terms in Model 4 does not enhance the model’s explanatory power. Overall, the analysis suggests that depression among students is primarily driven by psychological and financial factors, rather than demographic characteristics like gender or academic year.

Final model selection

The final regression model, which includes Gender along with key predictors selected through automated variable selection, demonstrates strong overall fit (Adjusted R² = 0.610, AIC = 10150.66). While Gender remains statistically non-significant (p = 0.89), it is retained due to its theoretical importance to the research hypothesis. The most significant predictors of depression are Anxiety (p < 0.001) and Stress (p < 0.001), both showing strong positive associations with depression levels. Additionally, receiving a Scholarship/Waiver is associated with significantly higher depression scores (p = 0.009), suggesting a potential link to financial or academic stress. This final model strikes a strong balance between theoretical relevance and statistical efficiency.

# Constructing the final model
final_model <- lm(Depression ~ Gender + Scholarship_Waiver + Anxiety + Stress, data = mhp)

summary(final_model)

Call:
lm(formula = Depression ~ Gender + Scholarship_Waiver + Anxiety + 
    Stress, data = mhp)

Residuals:
     Min       1Q   Median       3Q      Max 
-23.4002  -2.8129   0.0033   2.7406  22.7363 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)            0.65295    0.41489   1.574  0.11571    
GenderMale            -0.03007    0.21767  -0.138  0.89013    
Scholarship_WaiverYes  0.62614    0.23983   2.611  0.00911 ** 
Anxiety                0.82458    0.02346  35.153  < 2e-16 ***
Stress                 0.15170    0.01937   7.830 8.33e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.167 on 1777 degrees of freedom
Multiple R-squared:  0.6109,    Adjusted R-squared:   0.61 
F-statistic: 697.5 on 4 and 1777 DF,  p-value: < 2.2e-16
# Checking model fit: AIC and adjusted R-squared
AIC(final_model)
[1] 10150.66
Regression diagnostics for the final model
plot(final_model)

Residuals vs fitted plot

The residuals are fairly evenly scattered round the horizontal line, especially in the lower and higher range of fitted values. However, there is slight curvature (wider spread at middle range of fitted values), indicating mild heteroscedasticity- the variance of residuals may not be constant across all levels of predicted Depression. In general, the plot looks acceptable and doesn’t show severe violations of assumption.

Q-Q plot

Most points lie very close to the line, especially in the center — this suggests that the residuals are approximately normally distributed, which is a good sign. However, there are points at far left and far right that slightly deviate from the line (for example, observations 64, 992, and 1013). These outliers indicate heavier tails than expected under normality.

Scale-Location plot

The trend line appears fairly flat, confirming that the spread of points is relatively uniform across the range of fitted values. This trend also mean that there is no clear funnel shape (which would suggest heteroscedasticity). This plot supports the assumption of homoscedasticity— the variance of residuals is fairly constant across the fitted values.

Residuals vs Leverage plot

The reported Cook’s distance values (0.002–0.008) are all below typical thresholds (0.5), suggesting no extremely influential points in this plot. The absence of points far beyond Cook’s distance lines suggests the model is not heavily influenced by outliers. To summarize, the plot suggests a relatively stable model with no extreme influential points, but a few high-leverage observations may need review.